Skip to content

feat: ROOT-11: Support reading JSONL from source cloud storages #7555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 90 commits into from
May 23, 2025

Conversation

matt-bernstein
Copy link
Contributor

@matt-bernstein matt-bernstein commented May 16, 2025

[x] Factor out JSON parsing and validation logic from import storages into a function that can be hot-swapped in LSO/LSE
[x] add it to settings as an app-specific variable
[x] replace stdlib parsing with pyarrow parsing to handle JSONL as well as JSON
[x] test coverage for JSONL
[x] feature flag
[x] unskip multitask import tests in LSO, to prepare for testing different behavior in LSE and LSO

Implementation notes:

  • had to move computation of row_index and row_group into get_data so that load_tasks_json, which will be swapped in LSO/LSE, will have the correct type in all cases - makes for a pretty large diff
  • simplified error messages significantly since the relevant logic is shared/deduped
  • had to stick with stdlib json parsing instead of removing it entirely:
    • in the case of single dict, pyarrow can't distinguish between this and a one-line JSONL file, so need stdlib json to set row_index=None
    • in the case of top-level list, pyarrow can't parse this at all since it assumes each line is an entry in a table. There is an open issue for this.
    • consider peeking at file extensions to avoid double-parsing (for clarity, not speed) - see TODO comment

matt-bernstein and others added 25 commits May 14, 2025 12:15
…k to task within file of tasks on cloud storage
Co-authored-by: Jo Booth <[email protected]>
Co-authored-by: Jo Booth <[email protected]>
@matt-bernstein matt-bernstein requested a review from a team as a code owner May 16, 2025 17:23
Copy link

netlify bot commented May 16, 2025

Deploy Preview for label-studio-docs-new-theme canceled.

Name Link
🔨 Latest commit 046f897
🔍 Latest deploy log https://app.netlify.com/projects/label-studio-docs-new-theme/deploys/683071d8991e7f000899e5ee

Copy link

netlify bot commented May 16, 2025

Deploy Preview for label-studio-storybook canceled.

Name Link
🔨 Latest commit 046f897
🔍 Latest deploy log https://app.netlify.com/projects/label-studio-storybook/deploys/683071d889b876000860254b

@github-actions github-actions bot added the feat label May 16, 2025
Copy link

netlify bot commented May 16, 2025

Deploy Preview for heartex-docs canceled.

Name Link
🔨 Latest commit 046f897
🔍 Latest deploy log https://app.netlify.com/projects/heartex-docs/deploys/683071d85bc6fb00085bb416

Copy link

netlify bot commented May 22, 2025

Deploy Preview for label-studio-playground canceled.

Name Link
🔨 Latest commit 046f897
🔍 Latest deploy log https://app.netlify.com/projects/label-studio-playground/deploys/683071d82309310008216e44

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented May 22, 2025

/fm sync

Workflow run

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented May 22, 2025

/git merge develop

Workflow run
Successfully merged: 1 file changed, 47 insertions(+), 2 deletions(-)

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented May 22, 2025

/fm sync

Workflow run

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented May 22, 2025

/git merge develop

Workflow run
Successfully merged: 5 files changed, 44 insertions(+), 38 deletions(-)

@matt-bernstein
Copy link
Contributor Author

matt-bernstein commented May 23, 2025

/fm sync

Workflow run

@matt-bernstein matt-bernstein merged commit 0669a38 into develop May 23, 2025
52 checks passed
@robot-ci-heartex robot-ci-heartex deleted the fb-ROOT-11 branch May 23, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants